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Abstract —We propose without loss of generality strategies to 
achieve a high-throughput FPGA-based architecture for a QC- 
LDPC code based on a circulant-1 identity matrix construction. 
We present a novel representation of the parity-check matrix 
(PCM) providing a multi-fold throughput gain. Splitting of the 
node processing algorithm enables us to achieve pipelining of 
blocks and hence layers. By partitioning the PCM into not 
only layers but superlayers we derive an upper bound on the 
pipelining depth for the compact representation. To validate 
the architecture, a decoder for the IEEE 802.lln (2012) Q] 
QC-LDPC is implemented on the Xilinx Kintex-7 FPGA with 
the help of the FPGA IP compiler (2) available in the NI 
LabVIEW™Communication System Design Suite (CSDS™) which 
offers an automated and systematic compilation flow where an 
optimized hardware implementation from the LDPC algorithm 
was generated in approximately 3 minutes, achieving an overall 
throughput of 608Mb/s (at 260MHz). As per our knowledge this 
is the fastest implementation of the IEEE 802.1In QC-LDPC 
decoder using an algorithmic compiler. 

Index Terms —5G, mm-wave, QC-LDPC, Belief Propagation 
(BP) decoding, MSA, layered decoding, high-level synthesis 
(HLS), FPGA, IEEE 802.iln. 

I. Introduction 

For the next generation of wireless technology collectively 
termed as Beyond-4G and 5G (hereafter referred to as 5G), 
peak data rates of upto ten Gb/s with overall latency less 
than 1ms |3j| are envisioned. However, due to the proposed 
operation in the 30-300GHz range with challenges such as 
short range of communication, increasing shadowing and rapid 
fading in time, the processing complexity of the system is 
expected to be high. In an effort to design and develop a 
channel coding solution suitable to such systems, in this paper, 
we present a high-throughput, scalable and reconfigurable 
FPGA decoder architecture. 

It is well known that the structure offered by QC-LDPC 
codes 0 makes them amenable to time and space efficient 
decoder implementations relative to random LDPC codes. We 
believe that, given the primary requirements of high decod¬ 
ing throughput, QC-LDPC codes or their variants (such as 
accumulator-based codes 0 ) that can be decoded using belief 
propagation (BP) methods are highly likely candidates for 5G 
systems. Thus, for the sole purpose of validating the proposed 
architecture, we chose a standard compliant code, with a 
throughput performance that well surpasses the requirement 
of the chosen standard. 


Insightful work on high-throughput (order of Gb/s) BP- 
based QC-LDPC decoders is available, however, most of such 
works focus on an ASIC design @>0 which usually requires 
intricate customizations at the RTL level and expert knowledge 
of VLSI design. A sizeable subset of which caters to fully- 
parallel |8] or code-specific |j9 ] architectures. From the point of 
view of an evolving research solution this is not an attractive 
option for rapid-prototyping. In the relatively less explored 
area of FPGA-based implementation, impressive results have 
recently been presented in works such as [10J, 03 and 
G3- However, these are based on fully-parallel architectures 
which lack flexibility (code specific) and are limited to small 
block sizes (primarily due to the inhibiting routing congestion) 
as discussed in the informative overview in (T3). Since our 
case study is based on fully-automated generation of the 
HDL, a fair comparison is done with another state-of-the-art 
implementation 03 in Section [IV] Moreover, in this paper, 
we provide without loss of generality, strategies to achieve 
a high-throughput FPGA-based architecture for a QC-LDPC 
code based on a circulant-1 identity matrix construction. 

The main contribution of this brief is a compact rep¬ 
resentation (matrix form) of the PCM of the QC-LDPC 
code which provides a multi-fold increase in throughput. In 
spite of the resulting reduction in the degrees of freedom 
for pipelined processing, we achieve efficient pipelining of 
two-layers and also provide without loss of generality an 
upper bound on the pipelining depth that can be achieved 
in this manner. The splitting of the node processing allows 
us to achieve the said degree of pipelining without utilizing 
additional hardware resources. The algorithmic strategies were 
realized in hardware for our case study by the FPGA IP [2j 
compiler in LabVIEW™ CSDS™ which translated the entire 
software-pieplined high-level language description into VHDL 
in approximately 3 minutes enabling state-of-the-art rapid¬ 
prototyping. We have also demonstrated the scalability of 
the proposed architecture in an application that achieves over 

2Gb/s of throughput G3- 

The remainder of this paper is organized as follows. Section 
[II] describes the QC-LDPC codes and the decoding algorithm 
chosen for this implementation. The strategies for achieving 
high throughput are explained in Section III The case study 
is discussed in Section |Tvl and we conclude with Section [Y] 




Check Nodes (CN) 



Figure 1: A Tanner graph where VNs (representing the code 
bits) are shown as circles and CNs (representing the parity- 
check equations) are shown as squares. Each edge in the graph 
corresponds to a non-zero entry (1 for binary LDPC codes) in 
the PCM H. 


II. Quasi-Cyclic LDPC Codes and Decoding 

LDPC codes (due to R. Gallager [16j) are a class of 
linear block codes that have been shown to achieve near¬ 
capacity performance on a broad range of channels and are 
characterized by a low-density (sparse) PCM representation. 

Mathematically, an LDPC code is a null-space of its to x n 
PCM H, where m denotes the number of parity-check equa¬ 
tions or parity-bits and n denotes the number of variable nodes 
or code bits 0. In other words, for a rank to PCM H, to is 
the number of redundant bits added to the k information bits, 
which together form the codeword of length n = k + to. In 
the Tanner graph representation (due to Tanner 03), h is 
the incidence matrix of a bipartite graph comprising of two 
sets: the check node (CN) set of to parity-check equations and 
the variable node (VN) set of n variable or bit nodes; the i th 
CN is connected to the j th VN if H(«, j) = 1, 1 < i < m 
and 1 < j < n. A toy example of a Tanner graph is shown 
in Fig. |Tj The degree d Ci (d Vj ) of a CN i (VN j ) is equal 
to the number of Is along the i th row (j th column) of H. 
For constants c c ,c v € Z >0 and c c « m,c v « n, if Vi,j, 
d Ci = Cc and d v . = c v , then the LDPC code is called as a 
regular code and is called an irregular code otherwise. 


The PCM matrix H is obtained by expanding Hj, using the 
mapping. 


r i s , s e <s\{—i} 

1 0, s G {-1} 


where, I s is an identity matrix of size z which is cyclically 
right-shifted by s = H b(ib,jb) and 0 is the all-zero matrix 
of size z x z. As H is composed of the submatrices I s and 
0, it has to = to./,. z rows and n = nb-z columns. H for the 
IEEE 802.1 In (2012) standard Jl] (used for our case study) 
with z = 81 is shown in Table U 


B. Scaled Min-Sum Algorithm for Decoding QC-LDPC Codes 

LDPC codes can be decoded using message passing (MP) 
or belief propagation (BP) |l6j, [ lTTj on the bipartite Tanner 
graph where, the CNs and VNs communicate with each other, 
successively passing revised estimates of the log-likelihood 
ratio (LLR) associated, in every decoding iteration. In this 
work we have employed the efficient decoding algorithm 
presented in G3- with pipelined processing of layers based on 
the row-layered decoding technique |20), detailed in Section 

mm 


Definition 1. For 1 < i < m and 1 < j < n, let v :j denote the 
jth i n jj le i en gfh n codeword and y 3 = Vj + rij denote the 
corresponding received value from the channel corrupted by 
the noise sample n 3 . Let the variable-to-check (VTC) message 
from VN j to CN i be qij and, let the check-to-variable (CTV) 
message from CN i to VN j be r l3 . Let the a posteriori 
probability (APP) ratio for VN j be denoted as p 3 . 

The steps of the scaled-MSA 0 - m are given below. 

1) Initialize the APP ratio and the CTV messages as, 


(0) ; 
Pj = In 

r (0) = 0 

1 ij u > 


Pjvj = 0\yj) \ 

p ( v j = %•) J ’ 


1 < j < n 


1 < i < m, 1 < j < n 


( 1 ) 


2) Iteratively compute at the t th decoding iteration. 


(t) (t-i) (t-i) 

a■ ■ = ry. — r ■ ■ 

Hlj Fj ij 


( 2 ) 


A. Quasi-Cyclic LDPC Codes 

The first LDPC codes by Gallager are random, which 
complicate the decoder implementation, mainly because a ran¬ 
dom interconnect pattern between the VNs and CNs directly 
translates to a complex wire routing circuit on hardware. 
QC-LDPC codes belong to the class of structured codes 
that are relatively easier to implement without significantly 
compromising performance. 

The construction of identity matrix based QC-LDPC codes 
relies on an to& x rii, matrix H/, sometimes called as the base 
matrix which comprises of cyclically right-shifted identity and 
zero submatrices both of size z x z where, z £ Z + , 1 < ib < 
(mb) and 1 < jb < (nb), the shift value, 

s = H b(ib,jb) £ S = {-1} U{0,...z- 1} 


r ij = a ' II si 9 n (®<fc) • 

keN(i)\{j} 


min 

keN(i)\{j} 



(3) 


(t) (t) , 

Pi = la + r ■ 


(t) 


(4) 


where, 1 < i < to, and k £ Af(i)\{j} represents the set 
of the VN neighbors of CN i excluding VN j. Let t max 
be the maximum number of decoding iterations. 

3) Decision on the code bit Vj, 1 < j < n as. 


f 0, pj < 0 

1, otherwise 


4) If vH t = 0, where v = (v lt v 2 ,..., v n ), or t = t max , 

declare v as the decoded codeword. 
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Table I: Base matrix H 6 for z = 81 specified in IEEE 802.1 In (2012) standard used in the case study. L\ — L\ 2 are the layers 
and /j | — B 24 are the block columns (see Section III-Ci. Valid blocks (see section III-Di are highlighted. 


It is well known that since the MSA is an approximation 
of the SPA, the performance of the MSA is relatively worse 
than the SPA 0- However, in HD it has been shown that 
scaling the CTV messages can improve the performance 
of the MSA. Hence, we scale the CTV messages by a factor 
a (=0.75). 

Remark 1. The standard MP algorithm is based on the so- 
called flooding or two-phase schedule where, each decoding 
iteration comprises of two phases. In the first phase, VTC 
messages for all the VNs are computed and, in the second 
phase the CTV messages for all the CNs are computed, strictly 
in that order. Thus, message updates from one side of the graph 
propagate to the other side only in the next decoding iteration. 
In the algorithm given in HD however, message updates can 
propagate across the graph in the same decoding iteration. 
This provides advantages eg such as, a single processing 
unit is required for both CN and VN message updates, memory 
storage is reduced on account of the on-the-fly computation of 
the VTC messages q l3 and the algorithm converges faster than 
the standard MP flooding schedule requiring fewer decoding 
iterations. 

III. Techniques for High-throughput 

To understand the high-throughput requirements for LDPC 
decoding, let us first define the decoding throughput T of an 
iterative LDPC decoder. 

Definition 2. Let F c be the clock frequency, n be the code 
length, Ni be the number of decoding iterations and N c be 
the number of clock cycles per decoding iteration, then the 


throughput of the decoder is given by, T = b/s 

Even though, n and N, are functions of the code and 
the decoding algorithm used, F c and N c are determined 
by the hardware architecture. Architectural optimization such 
as the ability to operate the decoder at higher clock rates 
with minimal latency between decoding iterations can help 
achieve higher throughput. We have employed the following 
techniques to increase the throughput given by Definition [2] 

A. Linear Complexity Node Processing 

As noted in Section |II-B| separate processing units for 
CNs and VNs are not required unlike that for the flooding 
schedule. The hardware elements that process equations <0- 
(|4| are collectively referred to as the Node Processing Unit 
(NPU). 

Careful observation reveals that, among equations (|2j-(|4|, 
processing the CTV messages r l:j , 1 < i < m and 1 < j < n 
is the most computationally intensive due to the calculation of 
the sign, and the minimum value operations. The complexity 
of processing the minimum value is O(d^). In software, this 
translates to two nested for-loops, an outer loop that executes 
d Ci times and an inner loop that executes ( d Ci — 1 ) times. 

To achieve linear complexity 0(d Ci ) for the minimum value 
computation, we split the process into two phases or passes: 
the global pass where the first and the second minimum (the 
smallest value in the set excluding the minimum value of the 
set) for all the neighboring VNs of a CN are computed and 
the local pass where the first and second minimum from the 
global pass are used to compute the minimum value for each 
neighboring VN. Based on the functionality of the two passes. 


































the NPU is divided into the Global NPU (GNPU) and the 
Local NPU (LNPU). The algorithm is given below. 

1) Global Pass: 

i. Initialization: Let I denote the discrete time-steps such 
that, I G {0} U {1, 2,..., |AA(*)|} and let /W and 
denote the value of the first and the second minimum 
at time l respectively. The initial value at time I = 0 
is, 


f( 0 ) = s (o) = ^ (6) 

ii. Comparison: Let ki(£) G A/”(i), £ = 

{1,2,..., |A/ r («)|}, denote the index of the £ th 
neighboring VN of CN i. Note that, ki(£) depends 
on i and £, specifically, for a given CN i it is a 
bijective function of t. An increment from (£ — 1) to 
i corresponds to moving from the edge CN i fG VN 
ki(£ — 1) to the edge CN i o VN ki(£). 



s 


W = 


\Qiki(e)\i 

iQiinwl, 

s^-V, 


< f {t ^ 

otherwise. 

fV- 1 ) <\q ikiW \<sV-V 

\Qiki(t)\ < f (t 1] 
otherwise. 


( 8 ) 


Thus, /(^">«»>) and are the first and second mini¬ 

mum values for the set of VN neighbors of CN i, where, 

£max = |A(*(i)|. 

2) Local Pass: Let the minimum value as per equation ([3]) 
for VN ki{£) be denoted as £ G {1,2,..., |A/"(i)|} 

then. 


IILIIL 

q%ki{e) 


f (U ™\ \q iki (t)\ ± / (C * ax) 

si*™*), otherwise. 


(9) 


In software, this translates to two consecutive for-loops, each 
executing (d Ci — 1) times. Consequently, this reduces the 
complexity from 0(df.) to 0(d Ci ). A similar approach is also 
found in §• The sign computation is processed in a 

similar manner. 


VN zJ ... 


VN zJ+1 


■ • ■ VN z( j +1 )_! 

NPU 0 0 ... 

0 

1 

0 

0 

NPU! 0 ... 

0 

0 

1 

0 

1 ! 

NPU z _ 2 0 ... 

0 

0 

0 

0 

NPU Z ! 0 ... 

1 

0 

0 

0 


Table II: Arbitrary submatrix I s in H, 0 < ./ < n/, - 1, 
illustrating the opportunity to parallelize z NPUs. 


B. z-fold Parallelization of NPUs 

The CN message computation given by equation 0 is 
repeated m times in a decoding iteration i.e. once for each 
CN. A straightforward serial implementation of this kind is 
slow and undesirable. Instead, we apply a strategy based on 
the following understanding. 

Fact 1. An arbitrary submatrix I s in the PCM H corresponds 
to z CNs connected to z VNs on the bipartite graph, with 
strictly 1 edge between each CN and VN. 

This implies that no CN in this set of z CNs given by 
I s shares a VN with another CN in the same set. Table [II] 
illustrates such an arbitrary submatrix in H. This presents us 
with an opportunity to operate z NPUs in parallel (hereafter 
referred to as an NPU array), resulting in a z-fold increase in 
throughput. 


C. Layered Decoding 

From Remark [T| it is clear that, in the flooding schedule all 
nodes on one side of the bipartite graph can be processed in 
parallel. Although, such a fully parallel implementation may 
seem as an attractive option for achieving high-throughput 
performance, it has its own drawbacks. Firstly, it becomes 
quickly intractable in hardware due to the complex inter¬ 
connect pattern. Secondly, such an implementation usually 
restricts itself to a specific code structure. In spite of the serial 
nature of the algorithm in II-B| one can process multiple nodes 
at the same time if the following condition is satisfied. 


Fact 2. From the perspective of CN processing, two or 
more CNs can be processed at the same time (i.e. they are 
independent of each other) if they do not have one or more 
VNs (code bits) in common. 


The row-layering technique used in this work essentially 
relies on the above condition being satisfied. In terms of H, 
an arbitrary subset of rows can be processed at the same time 
provided that, no two or more rows have a 1 in the same 
column of H. This subset of rows is termed as a row-layer 
(hereafter referred to as a layer). In other words, given a set 
C = {Li, L 2 , ■ ■ ■, Li} of / layers in H, \/u G {1,2,...,/} 
and Vi, i 7 G L u , then, A f(i) flA f(i') = f. 

Observing that, \Lu\ = tn, in general, L u can be 

any subset of rows as long as the rows within each subset 
satisfy the condition in Fact 0 implying that, |L„| \L U ,\, 
\/u,v! G {1,2,...,/} is possible. Owing to the structure of 
QC-LDPC codes, the choice of |L„| (and hence I) becomes 
much obvious. Submatrices I s in H& (with row and column 
weight of 1) guarantee that, for the z CNs corresponding to 
the rows of I s ), always satisfy the condition in Fact [2] Hence, 
\L U \ = \L U >\ = z is chosen. 

From the VN or column perspective, \L U \ = z, Vtt = 
{1,2,...,/} implies that, the columns of H are also divided 
into subsets of size z (hereafter referred to as block columns) 
given by the set B = {Bi, B 2 , ..., Bj}, J = " = n b . Ob¬ 
serving that VNs belonging to a block column may participate 
in CN equations across several layers, we further divide the 












block columns into blocks, where a block is the intersection 
of a layer and a block column. Two or more layers L u , L u / 
are said to be dependent with respect to the block column B w 
if, H;,(u, w) —1 and, Hf ,(u',w) ^ —1 and are said to be 
independent otherwise. 
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1 
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79 

79 
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4- 

i 

; ... 

L9 

4- 

i 

; ... 

L10 

45 

1 

70 

Lu 

56 

1 

57 

L12 

4- 

61 

; ... 


to L 4 to L 2 to L 5 


Table III: Illustration of Message Passing in row-layered 
decoding in a Section of the PCM Hf,. 


For example, in Table III we can see that layers L 4 , L 7 , L 10 
and Lu are dependent with respect to block column B 2 . 
Assuming that the message update begins with layer L 1 and 
proceeds downward, the arrows represent the directional flow 
of message updates from one layer to another. Thus, layer 
A 7 cannot begin updating the VNs associated with block 
column B 2 before layer L,\ has finished updating messages 
for the same set of VNs and so on. The idea of parallelizing 
z NPUs seen in Section III-B can be extended to layers, NPU 
arrays can process message updates for multiple independent 
layers. It is clear that, dependent layers limit the degree of 
parallelization available to achieve high-throughput. In Section 


III-E we discuss pipelining methods that allow us to overcome 


layer-to-layer dependency and improve throughput. 


D. Compact Representation of Hf, 

Before we discuss the pipelined processing of layers, we 
present a novel compact (thus efficient) matrix representa¬ 
tion leading to a significant improvement in throughput. To 
understand this, let us call 0 submatrices in H as invalid 
blocks, where there are no edges between the corresponding 
CNs and VNs, and the submatrices I s as valid blocks. In 
a conventional approach to scheduling (for example in [7j), 
message computation is done for all the valid and invalid 


blocks. To avoid processing invalid blocks, we propose an 
alternate representation of H/, in the form of two matrices: f3 r 
(Table [TV]), the block index matrix and (3 S (Table [VJ, the block 
shift matrix. f3j and (3 S hold the index locations and the shift 
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Table IV: Block index matrix (3j showing the valid blocks 
(highlighted) to be processed. 

values (and hence the connections between the CNs and VNs) 
corresponding to only the valid blocks in Hf,, respectively. 
Construction of (3j is based on the following definition, 

Definition 3. Construction of /3j is as follows, 
for w = {l,2, 

set w = 0, jf, = 0 

for jb = {1,2, ...,n b } 
jb =jb + 1 
ifH b (u,j b ) 7^-1 

w = w + 1 ;f3j(u,w) = j b ;(3 s (u,w) = U b (u,j b )- 

To observe the benefit of this alternate representation, let us 
define the following ratio. 

Definition 4. Let A denote the compaction ratio, which is the 
ratio of the number of columns of /3j (which is the same for 
(3 S ) to the number of columns of Hf,. Hence, A = 

The compaction ratio A is a measure of the compaction 
achieved by the alternate representation of Hb- Compared to 
the conventional approach, scheduling as per the (1/ and (3 S 
matrices improves throughput by \ times. In our case study, 
A = 2k = 1, thus providing a throughput gain of j = 3. 

Remark 2. In the irregular QC-LDPC code in our case study, 
all layers comprise of 7 blocks each, except layer L 7 and 
A 12 which have 8. With the aim of minimizing hardware 









































Layers {. 
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b 3 
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b 7 bs 

Li 

57 

50 

11 

50 

79 

1 

0 

-1 

l 2 

3 

28 

0 

55 

7 

0 

0 

-1 

L3 

30 

24 

37 

56 

14 

0 

0 

-1 

l 4 

62 

53 

53 

3 

35 

0 

0 

-1 

U 

40 

20 

66 

22 

28 

0 

0 

-1 

L 6 

0 

8 

42 

50 

8 

0 

0 

-1 

L 7 

69 

79 

79 

56 

52 

0 

0 

0 

Ls 

65 

38 

57 

72 

27 

0 

0 

-1 

Lg 

64 

14 

52 

30 

32 

0 

0 

-1 

L10 

45 

70 

0 

77 

9 

0 

0 

-1 

L11 

2 

56 

57 

35 

12 

0 

0 

-1 

L12 

24 

61 

60 

27 

51 

16 

1 
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Table V: Block shift matrix (3s showing the right-shift values 
for the valid blocks to be processed. 


formation of superlayers of suitable size is crucial to achieve 
parallelism in the architecture. 
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12 
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complexity by maintaining a static memory-address generation 
pattern (does not change from layer-to-layer), our implemen¬ 
tation assumes regularity in the code. The decoder processes 
8 blocks for each layer of the f3j matrix resulting in some 
throughput penalty. The results from processing the invalid 
blocks in L 7 and L i 2 are not stored in the memory. 


E. Layer-Pipelined Decoder Architecture 


In Section III-C we saw how dependent layers for a block 
column cannot be processed in parallel. For instance, in Hj, in 
Table [TJ VNs associated with the block column B\ participate 
in CN equations associated with all the layers except layer 
Liq, suggesting that there is no scope of parallelization of 
layer processing at all. This situation is better observed in /3j 
shown in Table IIVI 


Fact 3. If a block column of (3i has a particular index 
value appearing in more than one layer, then the layers 
corresponding to that value are dependent with respect to that 
block column. 


Proof: Follows directly by applying Fact [2] to Definition 

0 ■ 

In other words, \/u, v! £ {1,2,..., /}, \/w € {1,2,..., J}, 
if, /3i(u,w) = /3i(u',w) then, the layers L u and L u / are 
dependent. It is obvious that, to process all layers in parallel 
(Li to L 12 in 0 - the condition, 

Pi (u,w) ,w) (10) 

must hold for Vit, it' € {1,2,...,/}. We call the set of layers 
£ satisfying Fact [3] as a superlayer. As will be seen later, the 


Table VI: Rearranged Block Index Matrix f3' r used for our 
work, showing the valid blocks (highlighted) to be processed. 


The idea is to rearrange the /3j matrix elements from their 
original order. If (3i(u,w ) = (3i(u\w ), u < v! then stagger 
the execution of f3i(v!, w) with respect to Pi(u, iu) by placing 
/3j(V, w) in Pj(u', w') such that, w < w'. To understand how 
layers are pipelined, let us first look at the non-pipelined case. 

Without loss of generality. Fig. |2|a) shows the block-level 
view of the NPU timing diagram without the pipelining of 
layers. As seen in Section III-A| the GNPU and LNPU operate 
in tandem and in that order, implying that the LNPU has to 
wait for the GNPU updates to finish. The layer-level picture 
is depicted in Fig. |3ja). We call this version as the \x 
version. This idling of the GNPU and LNPU can be avoided 
by introducing pipelined processing of blocks given by the 
following Lemma. 


Lemma 1. Within a superlayer, while the LNPU processes 
messages for the blocks /3'(u,w), the GNPU can process 
messages for the blocks /3'(u + 1, w), u = {1,2,..., \£\ — 1} 
and it) = {1,2,..., J}. 

Proof: Follows directly from the layer independence 
condition in Fact [2] ■ 

Fig. 0c) illustrates the block-level view of this 2-layer pipelin¬ 
ing scheme. It is important to note that, the splitting of the 
NPU process into two parts, namely, the GNPU and the LNPU 
(that work in tandem) is a necessary condition for Fact [3] (and 
hence Lemma |T]» to hold. However, at the boundary of the 
superlayer the Lemma [I] does not hold and pipelining has to 
be restarted for the next layer as seen in the layer-level view 










































shown in Fig. [3jc). We call this version as the 2x version. 
This is the classical pipelining overhead. In the following, we 
impose certain constraints on the size of the superlayers in H. 


Definition 5. Without loss of generality, the pipelining effi¬ 
ciency r] p is the number of layers processed per unit time per 
NPU array. 


For the case of pipelining two layers shown in Fig. |3jc), 

|£| 


„( 2 ) = 

'Ip 


14+1 


(ID 


Thus, we impose the following conditions on |£|: 

1) Since, two layers are processed in the pipeline at any 
given time, provided that I is even. 


|£| G T = {x : x is an even factor of I}. 

It is important to note that, for any value of \C\ £ J-, £ 
must be a superlayer. 

2) Given a QC-LDPC code, |£| is a constant. This is to 
facilitate a symmetric pipelining architecture which is a 
scalable solution. 

3) Choice of |£| should maximize pipelining efficiency rj p . 


l* = arg max r] p 
\c\er 


Case Study. Table VI shows one such rearrangement of 


/3/ for the QC-LDPC code for our case study in Table IV 
Unresolved dependencies are shown in blue in Table VI 
I = mb = 12, 7F = {2,4,6} and, l* = argmax| £ | gjr p p = 6. 


The rearranged block index matrix fi'j is shown in Table VI 
and the layer-level view of the pipeline timing diagram for 
the same is shown in Fig. [3jd). 


High-level FPGA-based Decoder Architecture-. The high-level 
decoder architecture is shown in Fig. [4] The ROM holds the 
LDPC code parameters specified by the f3j and the f3' s along 
with other code parameters such as the block length and the 
maximum number of decoding iterations. The APP memory 
is initialized with the channel LLR values corresponding to 
all the VNs as per equation 0- The barrel shifter operates on 
blocks of VNs (APP values in equation (|4]») of size zxf, where 
/ is the fixed-point word length used in the implementation 
for APP values. It circularly rotates the values to the right 
by using the shift values from the 0 S matrix in the ROM, 
effectively implementing the connections between the CNs 
and VNs. The cyclically shifted APP memory values and the 
corresponding CN message values for the block in question 
are fed to the NPU arrays. Here, the GNPUs compute VN 
messages as per equation d2) and the LNPUs compute CN 
messages as per equation d3}7 These messages are then stored 
back at their respective locations in the RAMs for processing 
the next block. 


IV. Case Study 

To evaluate the proposed strategies for achieving high- 
throughput, we have implemented the scaled-MSA based 


decoder for the QC-LDPC code in the IEEE 802.1 In (2012). 
For this code, mb x rib = 12 x 24, z = 27, 54 and 81 
resulting in code lengths of n = 24 x 2 = 648, 1296 and 1944 
bits respectively. Our implementation supports the submatrix 
size of z = 81 and hence is capable of supporting all the 
block lengths for the rate R = } code. At the time of 
writing this paper, we have successfully implemented the two 
aforementioned versions. 

1) lx: The block-level and the layer-level view of the 
pipelining is illustrated in Fig. [2jb) and [3jb) respectively. 

2 ) 2x: Pipelining is done in software at the algorithmic de¬ 
scription level. The block and layer level views of the pipelined 
processing are shown in Fig. [2jd) and [3jd) respectively. With 
an efficiency r= 0.86, the 2x version is 1.7 times faster 
than the lx version. 

We represent the input LLRs from the channel and the CTV 
and VTC messages with 6 signed bits and 4 fractional bits. Fig. 
[5] shows the bit-error rate (BER) performance for the floating¬ 
point and the fixed-point data representation with 8 decoding 
iterations. As expected, the fixed-point implementation suffers 
by about 0.5dB compared to the floating point version. The 
decoder algorithm was described using the LabVIEW CSDS 
software. The FPGA IP compiler was then used to generate 
the VHDL code from the graphical dataflow description. The 
VHDL code was synthesized, placed and routed using the 
Xilinx Vivado compiler on the Xilinx Kintex-7 FPGA available 
on the NIPXIe-7975R FPGA board. The decoder core achieves 
an overall throughput of 608Mb/s at an operating frequency 
of 200MHz and a latency of 5.7us. Table VII shows that the 
resource usage for the 2 x version (almost twice as fast due to 
pipelining) is close to that of the lx version. The FPGA IP 
compiler chooses to use more FF for data storage in the lx 
version, while it uses more BRAM in 2x version. Compared 
to a contemporary FPGA-based implementation in |[l4j using 
high-level algorithmic description compiled to an HDL, our 
implementation achieves a higher throughput with relatively 
lesser resource utilization. Authors of 0 have implemented 
a decoder for a R = i, n = 648, IEEE 802.1 In (2012) code 
that achieves a throughput of 13.4Mb/s at 122MHz, utilizes 
2% of slice registers, 3% of slice LUTs and 20.9% of Block 
RAMs on the Spartan -6 LX150T FPGA with a comparable 
BER performance. 

2.06 Gb/s LDPC Decoder (23) : An application of this work 
has been demonstrated in IEEE GLOBECOM’ 14 where the 
QC-LDPC code for our case study was decoded with a 
throughput of 2.06 Gb/s. This throughput was achieved by 
using five decoder cores in parallel on the Xilinx K7 (410t) 
FPGA in the NI USRP-2953R. 


V. Conclusion 

In this brief we have proposed techniques to achieve high- 
throughput performance for a MSA based decoder for QC- 
LDPC codes. The proposed compact representation of the 
PCM provides significant improvement in throughput. An 
IEEE 802.1 In (2012) decoder is implemented which attains 
a throughput of 608Mb/s (at 260MHz) and a latency of 5.7qs 
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Figure 2: Block-level view of the pipeline timing diagram, (a) General case for a circulant-1 identity submatrix construction 
based QC-LDPC code (see Section [ll]» without pipelining, (b) Special case of the IEEE 802.1 In QC-LDPC code used in 
this work without pipelining (c) Pipelined processing of two layers for the general QC-LDPC code case in (a), (d) Pipelined 
processing of two layers for the IEEE802.11n QC-LDPC code case in (b). 



Figure 3: Layer-level view of the pipeline timing diagram, (a) General case for a circulant-1 identity submatrix construction 
based QC-LDPC code (see Section [IJ without pipelining, (b) Special case of the IEEE 802.1 In QC-LDPC code used in 
this work without pipelining (c) Pipelined processing of two layers for the general QC-LDPC code case in (a), (d) Pipelined 
processing of two layers for the IEEE802.11n QC-LDPC code case in (b). 















































































































































































































































































z: submatrix size 
n: codeword length 
N: number of col umns in H b 
d c ; check node (CN) degree 
f: implementation word length 
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Figure 4: High-level decoder architecture, showing the z-fold parallelization of the NPUs with an emphasis on the splitting of 
the sign and the minimum computation given in equation ([3ji. Note that, other computations in equations (jTji-([4]i are not shown 
for simplicity here. For both the pipelined and the non-pipelined versions, processing schedule for the inner Block Processing 
loop is as per Fig. [2j and that for the outer Layer Processing loop is as per Fig. [3] 



Figure 5: Bit Error Rate (BER) performance comparison 
between uncoded BPSK (rightmost), rate=l/2 LDPC with 4 
iterations using fixed-point data representation (second from 
right), rate=l/2 LDPC with 8 iterations using fixed-point 
data representation (third from right), rate=l/2 LDPC with 8 
iterations using floating-point data representation (leftmost). 



lx 

2x 

Device 

Kintex-7k410t 

Kintex-7k410t 

Throughput(Mb/s) 

337 

608 

FF(%) 

9.1 

5.3 

BRAM(%) 

4.7 

6.4 

DSP48(%) 

5.2 

5.2 

LUT(%) 

8.7 

8.2 


Table VII: LDPC Decoder IP FPGA Resource Utilization & 
Throughput on the Xilinx Kintex-7 FPGA. 


on the Xilinx Kintex-7 FPGA. The FPGA IP compiler greatly 
reduces prototyping time and is capable of implementing 
complex signal processing algorithms. There is undoubtedly 
more scope for improvement, nevertheless, our current results 
are promising. 
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